Transferring data from CSV files to SQL databases is a common task in data management. In this blog post, we’ll show you how to use Python to convert CSV data into an SQL database. Using libraries like pandas and sqlite3, you can easily automate this process, making your data workflow more efficient. Follow along to learn the essential steps and code snippets needed to seamlessly migrate your data from CSV to SQL.
We’ll be using the DC character dataset provided on kaggle here.
You don’t need to install sqlite3 with Python, it comes pre-packaged with versions of Python later than 2.5.
We will just import it along with pandas (You would have to install Pandas in PIP):
import sqlite3
import pandas as pd
Connecting to the SQLite3 database
If you’re using a mac, you don’t need to install SQLite on your system, and If you do not already have sqlite3, you’ll have to install . But don’t worry, the installation process is quite simple.
You don’t actually need to create the database. When you use the sqlite3.connect() function, it will automatically create database for you if it doesn’t exist in the folder.
And you will see the name of your database and its location as well. Now we can connect to the database with Python with a simple command:
conn = sqlite3.connect('characters.db') # Connecting to the database
cursor = conn.cursor() # Object to run queries
Creating the table
To create an SQLite table, we’ll need to know all the columns before hand. Pandas can help us with this. Load in the database and run the info function
df = pd.read_csv('dc-wikia-data.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6896 entries, 0 to 6895
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 page_id 6896 non-null int64
1 name 6896 non-null object
2 urlslug 6896 non-null object
3 ID 4883 non-null object
4 ALIGN 6295 non-null object
5 EYE 3268 non-null object
6 HAIR 4622 non-null object
7 SEX 6771 non-null object
8 GSM 64 non-null object
9 ALIVE 6893 non-null object
10 APPEARANCES 6541 non-null float64
11 FIRST APPEARANCE 6827 non-null object
12 YEAR 6827 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 700.5+ KB
We’ll need to add each of these columns to our query. For now, we’ll just assume all of them are strings.
Since the FIRST APPEARANCE column has a space in it, we’ll need to replace it with something else, like an underscore.
table_name = 'characters'
columns_with_types = ", ".join([f"{col.replace(' ', '_')} TEXT" for col in df.columns]) # Eg: page_id TEXT, name TEXT, urslug TEXT, ...
create_table_query = f"CREATE TABLE IF NOT EXISTS {table_name} ({columns_with_types});"
cursor.execute(create_table_query)
cursor.fetchall()
I’ve decided to name my table characters . Within the query, first we use a nice concise notation to check if the table already exists, if not create it using the CREATE TABLE IF NOT EXISTS command. Then, for each of the columns in the dataframe, we make the string “<column_name> TEXT”. We use the Python replacecommand to replace the space in the column name with an underscore. This is inline with the SQLite Syntax for creating a table:
CREATE TABLE <table_name> (
page_id TEXT,
name TEXT,
urslug TEXT,
...);
Finally we run the command and our table must have been created. To check if it has been, run the following command:
cursor.execute('pragma table_info(characters);')
cursor.fetchall()
You will see the following
[(0, 'page_id', 'TEXT', 0, None, 0),
(1, 'name', 'TEXT', 0, None, 0),
(2, 'urlslug', 'TEXT', 0, None, 0),
(3, 'ID', 'TEXT', 0, None, 0),
(4, 'ALIGN', 'TEXT', 0, None, 0),
(5, 'EYE', 'TEXT', 0, None, 0),
(6, 'HAIR', 'TEXT', 0, None, 0),
(7, 'SEX', 'TEXT', 0, None, 0),
(8, 'GSM', 'TEXT', 0, None, 0),
(9, 'ALIVE', 'TEXT', 0, None, 0),
(10, 'APPEARANCES', 'TEXT', 0, None, 0),
(11, 'FIRST_APPEARANCE', 'TEXT', 0, None, 0),
(12, 'YEAR', 'TEXT', 0, None, 0)]
Now, that our table has been created, we can simply list all the rows and add them one by one.
for index, row in df.iterrows():
values = ", ".join([f'"{row_item}"' for row_item in row])
insert_sql = f"INSERT INTO {table_name} ({', '.join(df.columns.str.replace(' ', '_'))}) VALUES ({values})"
cursor.execute(insert_sql)
We use the df.iterrows() command to go through each row. Then we make a list of all the row items, ensuring that each row item is a string that is sorrounded by quotations. Then we use the INSERT command to add all the values to the table.
To see if we’ve gotten all the rows, let’s check the number of rows in both the dataframe and the sqlite table.
df.shape # Returns (6896, 13)
cursor.execute('SELECT COUNT(*) FROM characters')
cursor.fetchall() # Returns [(6896,)]
So it seems that our code works. Finally don’t forget to commit to the connection and close it.
conn.commit()
conn.close()
Conclusion
In conclusion, converting a CSV file to an SQL database using Python and sqlite3 is a straightforward process that can be highly automated. By leveraging the pandas library for data manipulation and sqlite3 for database operations, you can efficiently transfer data between these formats. If you’re having trouble running the code, here’s the link to the IPython notebook.